1 1 Introduction

This paper discusses our project using data on housing prices in Melbourne in 2017. The establishment of a real estate price prediction system is a key task for the healthy development of the current real estate industry. Having a simple predictive and inferential method to model housing prices helps commerce determine fair prices and allows governments to determine property taxes. This project aims to learn how different factors may affect home sales price by building linear models. Although data that will be utilized was collected in Melbourne, Australia in 2017, the concept that location and home attributes correlate with housing prices could reasonably apply broadly and internationally.

The following questions are the main subjects which this project focuses on:

  1. Understand if housing prices in Melbourne, Australia can be predicted using this dataset.
  2. Determine what variables have the greatest impact on housing price.
  3. Analyze the impacts of location, seller, and construction attributes of homes on the housing market in Melbourne, Australia.

2 2 Exploratory Data Analysis (EDA)

The following are excerpts and graphs from our exploratory data analysis. This part of the project familiarizes the reader with our dataset’s attributes as well as lays the foundation for the variables will include in our linear model. The results of our EDA will also inform the future direction of the project.

2.1 The Melbourne Housing Snapshot Dataset

  • Home Sales in 2017
    • Location
    • Construction
    • Sale
  • Variables: 21
    • Numeric: 12
    • Categorical: 9

2.2 The Variables

Rooms: Number of rooms

Price: Price (AUS$)

Method: Method of sale - 5 categories

Type: House, Unit, Townhouse - 3 categories

SellerG: Real Estate Agent - 268 categories

Date: Date sold

Distance: Distance from Central Business District

Regionname: Region name - 8 categories

Propertycount: Number of properties that exist in the suburb

2.3 More Variables

Bedroom2 : Number of Bedrooms

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

YearBuilt: Year home built

CouncilArea: Governing council for the area - 34 categories

Lattitude, Longtitude: GPS location

Suburb: Suburb name - 314 categories

2.4 Goals

  1. Understand which attributes of a home and its sale determine final sale price
  2. Attempt to build a reasonable model for inference and/or prediction for final sale price

2.5 Summary of Price Statistics

Mean: $1,075,684

SD: 639310.724

data.full$Price
Min 85000
Q1 650000
Median 903000
Mean 1075684
Q3 1330000
Max 9000000

2.6 Select Data Pairs

2.7 Corrleations

2.8 Map of Melbourne Sales

2.9 Selling Price

2.10 Log Selling Price

2.11 Price by Region

2.12 Price by Number of Rooms (<10 Rooms)

2.13 Price by Type of Home

2.14 Test of Independence by Group (Pearson \(\chi^2\))

2.14.1 Type, Rooms, Regionname, SellerG

\(H_0\): All means equal by group

All reject \(H_0\) with p-value\(<2\times 10^{-16}\)

2.15 Price by Region and Type

2.16 Transform Data - Homogeneity

2.17 Transform Data - Normality

3 3 Linear Modelling

3.1 3.1 First Attempt at Linear Model

3.2 3.2 Linear Model 2: Removed the Variable with Highest VIF

3.3 Model Coefficients

3.4 3.3 Linear Model 3: Considered Interactions

3.5 3.3 Linear Model 4: Removed Land Size

3.6 Model 4 Coefficients

# 4 Residual Analysis

3.7 4.1 Homogeneity? No

3.8 4.2 Normal? Not Quite

3.9 4.3 Influence? Yes

3.10 Remove Influence Points

4 5 Proposed Model

## 
## Call:
## lm(formula = Price ~ Rooms + Distance + Bathroom + Car + BuildingArea + 
##     Lattitude + Longtitude + Propertycount + factor(Regionname), 
##     data = data.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2399657  -259926   -40715   178350  8109450 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                  -1.41e+08   1.79e+07   -7.87
## Rooms                                         3.21e+05   8.97e+03   35.73
## Distance                                     -4.44e+04   1.52e+03  -29.15
## Bathroom                                      1.71e+05   1.15e+04   14.94
## Car                                           5.59e+04   7.79e+03    7.17
## BuildingArea                                  4.21e+01   1.04e+01    4.04
## Lattitude                                    -7.82e+05   1.29e+05   -6.04
## Longtitude                                    7.70e+05   1.21e+05    6.37
## Propertycount                                -3.64e+00   1.58e+00   -2.30
## factor(Regionname)Eastern Victoria            1.65e+05   1.07e+05    1.54
## factor(Regionname)Northern Metropolitan      -5.90e+04   3.13e+04   -1.88
## factor(Regionname)Northern Victoria           5.40e+05   1.21e+05    4.47
## factor(Regionname)South-Eastern Metropolitan  1.54e+05   5.34e+04    2.89
## factor(Regionname)Southern Metropolitan       2.27e+05   2.83e+04    8.03
## factor(Regionname)Western Metropolitan       -7.28e+04   4.02e+04   -1.81
## factor(Regionname)Western Victoria            5.07e+05   1.40e+05    3.62
##                                              Pr(>|t|)    
## (Intercept)                                   4.3e-15 ***
## Rooms                                         < 2e-16 ***
## Distance                                      < 2e-16 ***
## Bathroom                                      < 2e-16 ***
## Car                                           8.4e-13 ***
## BuildingArea                                  5.5e-05 ***
## Lattitude                                     1.7e-09 ***
## Longtitude                                    2.0e-10 ***
## Propertycount                                  0.0212 *  
## factor(Regionname)Eastern Victoria             0.1229    
## factor(Regionname)Northern Metropolitan        0.0599 .  
## factor(Regionname)Northern Victoria           8.0e-06 ***
## factor(Regionname)South-Eastern Metropolitan   0.0039 ** 
## factor(Regionname)Southern Metropolitan       1.2e-15 ***
## factor(Regionname)Western Metropolitan         0.0701 .  
## factor(Regionname)Western Victoria             0.0003 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 452000 on 4942 degrees of freedom
##   (4548 observations deleted due to missingness)
## Multiple R-squared:  0.531,  Adjusted R-squared:  0.53 
## F-statistic:  373 on 15 and 4942 DF,  p-value: <2e-16

4.1 Testing \(R^2\)

\[ \begin{equation} R^2 = 1- \dfrac{RSS}{TSS} \end{equation}=0.441\]

5 6 Future Work

  • Further explore log transformation
  • Consider GLM with log link
  • What to do about factors with many levels (100’s)?
  • Missing data
  • Improve Prediction

6 7 Conclusion

7 8 Bibliography